Add Static FP8 KV Support #737

yiliu30 · 2025-08-15T03:59:59Z

No description provided.

Signed-off-by: yiliu30 <[email protected]>

Copilot

Pull Request Overview

This PR adds FP8 KV cache quantization support to the AutoRound quantization library. This feature enables quantizing the key-value cache in attention mechanisms to FP8 format for improved memory efficiency.

Adds a new enable_fp8_kv parameter to control FP8 KV cache quantization
Implements FP8 KV cache infrastructure with calibration and quantization context
Updates test coverage to verify FP8 KV cache serialization and functionality

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
auto_round/autoround.py	Adds enable_fp8_kv parameter and context manager integration
auto_round/experimental/fp8_kv_cache.py	Core FP8 KV cache implementation with quantization logic
test/test_cpu/test_export.py	Test updates to verify FP8 KV cache export functionality

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

auto_round/experimental/fp8_kv_cache.py

wenhuach21 · 2025-08-15T04:19:53Z

Please note that kv cache does not affect tuning as we only use forward in tuning. So the first change is moving the arg from init to save_quantized?

Signed-off-by: yiliu30 <[email protected]>

yiliu30 · 2025-08-20T05:46:33Z

Please note that kv cache does not affect tuning as we only use forward in tuning. So the first change is moving the arg from init to save_quantized?

Since the other non-tuning args, such as enable_torch_compile, are defined in __init__, how about keeping this one in __init__ as well?

Signed-off-by: yiliu30 <[email protected]>

wenhuach21 · 2025-08-20T05:51:54Z

Please note that kv cache does not affect tuning as we only use forward in tuning. So the first change is moving the arg from init to save_quantized?

Since the other non-tuning args, such as enable_torch_compile, are defined in __init__, how about keeping this one in __init__ as well?

Due to the large number of arguments in init, only the arguments likely to be used are explicitly listed, while the others are hidden in **kwargs. I assume that the static KV cache is not frequently used, so we can keep it **kwargs for now.

Signed-off-by: yiliu30 <[email protected]>

yiliu30 · 2025-08-20T06:07:09Z

Please note that kv cache does not affect tuning as we only use forward in tuning. So the first change is moving the arg from init to save_quantized?

Since the other non-tuning args, such as enable_torch_compile, are defined in __init__, how about keeping this one in __init__ as well?

Due to the large number of arguments in init, only the arguments likely to be used are explicitly listed, while the others are hidden in **kwargs. I assume that the static KV cache is not frequently used, so we can keep it **kwargs for now.

Agree, updated.

wenhuach21 · 2025-08-20T06:13:37Z

the code could be improved by providing another arg like static_kv_dtype to replace the exstiing enable_static_kv to support more dtype, fp4, int4/int8, etc.

yiliu30 · 2025-08-20T06:21:47Z

the code could be improved by providing another arg like static_kv_dtype to replace the exstiing enable_static_kv to support more dtype, fp4, int4/int8, etc.

Since only FP8 KV is supported in the vLLM end, I’d prefer to keep it as is in this PR. I’m open to adding support for more data types if needed.

wenhuach21 · 2025-08-20T06:30:20Z

the code could be improved by providing another arg like static_kv_dtype to replace the exstiing enable_static_kv to support more dtype, fp4, int4/int8, etc.

Since only FP8 KV is supported in the vLLM end, I’d prefer to keep it as is in this PR. I’m open to adding support for more data types if needed.

Ok, then how about changing the arg now to make the API more backward compatibility in case we need to support other dytpes in the future. only support bf16/fp16/fp8 for now, while for 16 bits, nothing needs to do in the code

Signed-off-by: yiliu30 <[email protected]>

yiliu30 · 2025-08-20T08:51:06Z

the code could be improved by providing another arg like static_kv_dtype to replace the exstiing enable_static_kv to support more dtype, fp4, int4/int8, etc.

Since only FP8 KV is supported in the vLLM end, I’d prefer to keep it as is in this PR. I’m open to adding support for more data types if needed.

Ok, then how about changing the arg now to make the API more backward compatibility in case we need to support other dytpes in the future. only support bf16/fp16/fp8 for now, while for 16 bits, nothing needs to do in the code

Sure, updated.

yiliu30 added 3 commits August 14, 2025 22:43

add fp8 kv

75da6b6

Signed-off-by: yiliu30 <[email protected]>

refactor

4b3f36a

Signed-off-by: yiliu30 <[email protected]>

clean code

eff6a34

Signed-off-by: yiliu30 <[email protected]>

yiliu30 requested a review from Copilot August 15, 2025 04:00

Copilot AI reviewed Aug 15, 2025

View reviewed changes

update name

b13c1cc

Signed-off-by: yiliu30 <[email protected]>

yiliu30 changed the title ~~Add FP8 KV Support~~ Add Static FP8 KV Support Aug 20, 2025

Merge branch 'main' into fp8kv

f8b3f50

yiliu30 marked this pull request as ready for review August 20, 2025 05:42

yiliu30 added 3 commits August 20, 2025 01:42

format code

0cf5807

Signed-off-by: yiliu30 <[email protected]>

update

f6b058a

Signed-off-by: yiliu30 <[email protected]>

correct docs

3c9749a

Signed-off-by: yiliu30 <[email protected]>

yiliu30 requested review from wenhuach21, n1ck-guo and WeiweiZhang1 August 20, 2025 05:46

add type hints

6e905dd

Signed-off-by: yiliu30 <[email protected]>

yiliu30 added 3 commits August 20, 2025 01:52

correct docs

7e44306

Signed-off-by: yiliu30 <[email protected]>

updated

866f32b

Signed-off-by: yiliu30 <[email protected]>

hide enable_static_fp8_kv

4a8e05a

Signed-off-by: yiliu30 <[email protected]>

yiliu30 added 3 commits August 20, 2025 04:36

rename arg name

e421585

Signed-off-by: yiliu30 <[email protected]>

fix ut

30c6255

Signed-off-by: yiliu30 <[email protected]>

rename file

9309c74

Signed-off-by: yiliu30 <[email protected]>

wenhuach21 approved these changes Aug 20, 2025

View reviewed changes

Update requirements.txt

90edf9a

yiliu30 merged commit fe1db09 into main Aug 21, 2025
8 checks passed

yiliu30 deleted the fp8kv branch August 21, 2025 00:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Static FP8 KV Support #737

Add Static FP8 KV Support #737

Uh oh!

yiliu30 commented Aug 15, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wenhuach21 commented Aug 15, 2025 •

edited

Loading

Uh oh!

yiliu30 commented Aug 20, 2025

Uh oh!

wenhuach21 commented Aug 20, 2025 •

edited

Loading

Uh oh!

yiliu30 commented Aug 20, 2025

Uh oh!

wenhuach21 commented Aug 20, 2025

Uh oh!

yiliu30 commented Aug 20, 2025

Uh oh!

wenhuach21 commented Aug 20, 2025

Uh oh!

yiliu30 commented Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

Add Static FP8 KV Support #737

Add Static FP8 KV Support #737

Uh oh!

Conversation

yiliu30 commented Aug 15, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

wenhuach21 commented Aug 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yiliu30 commented Aug 20, 2025

Uh oh!

wenhuach21 commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yiliu30 commented Aug 20, 2025

Uh oh!

wenhuach21 commented Aug 20, 2025

Uh oh!

yiliu30 commented Aug 20, 2025

Uh oh!

wenhuach21 commented Aug 20, 2025

Uh oh!

yiliu30 commented Aug 20, 2025

Uh oh!

Uh oh!

Uh oh!

wenhuach21 commented Aug 15, 2025 •

edited

Loading

wenhuach21 commented Aug 20, 2025 •

edited

Loading